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^ , Abstract 
^ . 

■ The problem of finding a longest common subsequence of two main sequences with 
CO . some constraint that must be a substring of the result (STR-IC-LCS) was formulated 

recently. It is a variant of the constrained longest common subsequence problem. 
' As the known algorithms for the STR-IC-LCS problem are cubic-time, the presented 

Q ■ quadratic-time algorithm is significantly faster. 
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^ ■ 1 Introduction 

m ■ 

^ I One of the most popular ways of measuring sequence similarity is computation of their 

■ longest common subsequence (LCS) [7], in which we are interested in a subsequence that is 
^ ■ common to all sequences and has the maximal possible length. It is well known that for two 

• • ! sequences of length n and m an LCS can be found in 0{n'm) time, which is a lower bound of 
\ time complexity in the comparison-based computing model for this problem |T]. In the more 
^ \ practical, RAM model of computations, the asymptotically fastest algorithm is the one by 
' Masek and Paterson which runs in O {nm / log n) time for bounded and O {mn log log n / log n) 
for unbounded alphabet [8]. 

One family of LCS-related problems considers one or more constraining sequences, such 
that (in some variants) must be included, or (in other problem variants) are forbidden as 
part of the resulting sequence [31 [9]. The motivation for these generalizations came from 
bioinformatics in which some prior knowledge is often available and one can specify some 
requirements on the result [9l [5] . 

In this work, we consider the problem called STR-IC-LCS, introduced in [3], in which a 
constraining sequence of length r must be included as a substring of a common subsequence 
of two main sequences and the length of the result must be maximal. In [3] an O (nmr )-time 
algorithm was given for it. Farhana et al. [6] proposed finite-automata-based algorithms for 
the STR-IC-LCS, CLCS, and two other problems defined by Chen and Chao [3]. The authors 
claim that the algorithms work in 0{r{n + m) + {n + m) log(n + m)) time in the worst case. 
It seems to be a breakthrough as it means also that the LCS problem could be solved in 
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0(n log n) time. Unfortunately, the time complexity analysis are based on the claim from [2] 
that a directed acyclic subsequence graph (DASG) for two sequences of lengths n and m 
contains 0{n + m) states and can be built in 0((n + m) \og{n + m)) time. As was shown 
by Crochemore et al. |1] this result was wrong and such a DASG contains VL{nm) states in 
the worst case, so its construction time cannot be lower. Thus, the algorithms by Farhana 
et al. [6] work in Q{nmr) time for the variants of the CLCS problem and Q{nm) for the 
LCS problem. Moreover, these complexities are under assumption that the alphabet size is 
constant, otherwise they should be multiplied by its size. 

In this paper, we propose the first quadratic-time algorithm for the STR-IC-LCS problem 
and show also further possible improvements of the time complexity. We also present how 
this algorithm can be extended to many main input sequences. 

The paper is organized as follows. In Section [21 some definitions are given and the 
problem is formally stated. Section [3] describes our algorithm. Extension to the case of 
many main sequences and some improvements of the algorithm are given in Section m The 
last section concludes. 

2 Definitions 

Let us have two main sequences A = aia2 ■ ■ - an and B = 6162 ■ ■ - bm and one constraining 
sequence P = piP2---Pr- W.l.o.g. we can assume that r < m < n. Each sequence is 
composed of symbols from alphabet S of size a. The length (or size) of any sequence X 
is the number of elements it is composed of and is denoted as A sequence X* is a 

subsequence of X if it can be obtained from X by removing zero or more symbols. The LCS 
problem for A and B is to find a subsequence C of both A and B of the maximal possible 
length. The LCS length for A and B is denoted by LLCS{A, B). A sequence /3 is a substring 
of X if X = a/37 for some, possibly empty, sequences a, /3, 7. An appearance of sequence 
X = X1X2 . . . x\x\ in sequence Y = yiy2 . . . y\Y\ starting at position j is a sequence of indexes 
ii,22, • • • ,"^1X1 such that ii = j, and X = yi-^ . . - yi^^y A compact appearance of a sequence 
X uiY starting at position j is the appearance of the smallest last index, i\x\- A match for 
sequences A and S is a pair (i, j) such that Oj = bj. The total number of matches for A and 
B is denoted by d. It is obvious that d < mn. 

The STR-IC-LCS problem for the main sequences A, B, and the constraining sequence P 
is to find a subsequence C of both A and B of the maximal possible length containing P as 
its substring. (In the CLCS problem, C must be a subsequence of P.) 

3 The algorithm 

The algorithm we propose is based on dynamic programming with some preprocessing. To 
show its correctness it is necessary to prove some lemma. 

Let C = C1C2 ... Q be a longest common subsequence with substring constraint for A, B, 
and P. Let also / = (ii, ji), (^2,^2), • • • , {ie,je) be a sequence of indexes of C symbols in A 
and B, i.e., C = ai^ai^ ■ ■ - ai^ and C = bj^bj^ . . . bj^. From the problem statement, there must 
exists such q e [1, £ - r + 1] that P = ai^ai^^^ . . . ai^^^_^ and P = bj^bj^^^ . . . bj^^^_^. 
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Lemma 1. Let i'^ = iq and for a// 1 G [1, r — 1], i'^j^^ he the smallest possible, hut larger than 
i'q+t-i, index in A such thatai^_^_^ = ai'^^^ The sequence of indexes I' = (ii, ji), . . . , {ig~i, jq~i), 

(i'qJq)^ i^'q+vjq+l)^ • • • > (i'q+r^l, jq+r-l) , {iq+rjq+r), • • • , i^, ji) dcfinCS fl longCSt COmmOn Sub- 

sequence of A and B with string constraint P equal C. 

Proof. From the definition of indexes i'^^^ it is obvious tliat tliey form an increasing sequence, 
since i'^ = iq, and i'g^r-i — ^g+r-i- Tlie sequence 2^, . . . , Zg+^-i is of course a compact 
appearance of P in A starting at iq. Therefore, both components of /' pairs form increasing 
sequences and for any {i'u,ju), = bj'^, so sequence /' defines an STR-IC-LCS C equal C. 

□ 

A similar lemma can be formulated for j-th component of sequence /. Thus, it is easy 
to conclude that when looking for an STR-IC-LCS, instead of checking any common sub- 
sequences of A and B it suffices to check only such common subsequences that contain 
compact appearances of P both in A and B. (This is a direct consequence of the fact that 
LLCS{X,Y) < LLCS{X,aY) for any sequence a.) 

The number of different compact appearances of P in A and B will be denoted by 
and d^, respectively. It is easy to notice that d^d^ < d, since a pair {i,j) defines a compact 
appearance of P in A starting at z-th position and compact appearance of P in P starting 
at j-th position only for some matches. 

The algorithm computing an STR-IC-LCS (Fig. [1]) consists of three main stages. In the 
first stage, both main sequences are preprocessed to determine for each occurrence of the 
first symbol of P, the index of the last symbol of a compact appearance of P. In the second 
stage, two DP matrices are computed: the forward one and the reverse one. The recurrence 
is exactly as for the LCS computation. 

In the last stage, the result is determined. To this end for each match {i,j) for A and B 
the ends {i',j') of compact appearances of P in A starting at i-th position and in B starting 
at j-th position are read. The length of an STR-IC-LCS containing these appearances of 
P is determined as a sum of the LCS length of prefixes of A and B ending at z-th and 
j-th positions, respectively, the LCS length of suffixes of A and B starting at i'-th and j'-th 
positions, respectively, and the constraint length. Since, the first and last constraint symbol 
was summed twice, the final result is decreased by 2. According to the F and R matrices, 
backtracking can be used to obtain the subsequence, not only its length. 

Lemma 2. The STR-IC-LCS algorithm (Fig. U\) correctly computes an STR-IC-LCS. 

Proof. The algorithm considers all pairs of compact appearances of P in A and B. Each 
such a pair divides the problem into two independent LCS length-computing subproblems. 
According to the precomputed F and R matrices it is easy to solve these subproblems (lines 
18-20) in constant time. The length of an STR-IC-LCS must be a sum of the found lengths 
of LCSs and the constraint length subtracted by 2. □ 

Lemma 3. The worst-case time complexity of the proposed algorithm is 0{mn). 

Proof. The preprocessing stage can be done in 0{{n + m)r) worst-case time. The main stage 
consists of computation of two DP matrices which needs 0{mn) time. In the final stage, the 
DP matrix is traversed and for each match a constant number of operations is performed, 
so these stages consumes 0{mn) time. Summing these up gives 0{mn) time. □ 
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STR-IC-LCS(A, B, P) 



{Preprocessing} 



1 for i <— 1 to n do 

2 if ai = pi then M^[i] ^ smallest q such that pi . . . is a subsequence of flj . . . 

3 for j ^ 1 to m do 

4 ii bi = pi then Af^[j] smallest q such that pi .. .pr is a subsequence of bj . . . bg 



{Computation of forward and reverse DP matrices} 



5 for i ^ to n + 1 do F[i, 0] ^ 0; R[i, m + 1] ^ 

6 for j ^ to m + 1 do F[0,j] ^ 0; R[n + 1, j] ^ 

7 for i 1 to n do 

8 for j 1 to m do 

9 if ai = bj then F[i,j] = F[i - 1, j - 1] + 1 

10 else F[i,j] = max(F[z - l,j],F[i,j - 1]) 

11 for i n downto 1 do 

12 for j m downto 1 do 

13 if Qi = bj then R[i,j] = R[i + 1, j + 1] + 1 

14 else F[i,j] = max{R[i + l,j],R[i,j + 1]) 
{Determination of the result} 

15 e^O;i*^ 0; j* ^ 

16 for i ^ 1 to n do 

17 for j ^ 1 to m do 

18 if Oi = bj and F[i,j] + R[M^[i], M^[j]] + r-2>£ then 

19 £^ F[i, j] + R[M^ [i] , [j]] + r - 2 

20 i* ^ i; j* ^ j 



21 Backtrack from {i* according to F and obtain 

22 Backtrack from (M^[i*], M'^[j*]) according to R and obtain 5^ 

23 return I and S^p2Pz ■ ■ ■Pr-iS'^ 



Figure 1: A pseudocode of the STR-IC-LCS computing algorithm for two main sequences 
and one constraining sequence 

Lemma 4. The space consumption of the algorithm is 0{mn). 

4 Improvements and extensions 

If one is interested only in the STR-IC-LCS length, it is easy to notice that F and R matrices 
can be computed row-by- row which means that only 0{m) words are necessary for them. 
The values of F and R for matches of symbols equal pi in F and pr in R must, however, be 
stored explicitly, so the space for them is 0{d*) (where d* = 0{mn) is the number of such 
matches). This gives the total space 0(n + d*). If also the subsequence is requested, the 
cells for all matches must be stored to allow backtracking, so the space is 0{n + d) (in the 
worst case d = 0{mn)). 

As the only cells that are necessary to be stored explicitly are those for matches, the Hunt- 
Szymanski method [7] can be used to speed up the computation of F and R matrices if the 
number of matches is small. Therefore, the second stage can be completed in 0{d\og\ogm + 
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n) time if a = 0{n) and 0{dloglogn + nlogn) otherwise. The time complexity of the 
final stage is 0{d). Adding the time for the preprocessing we obtain the worst case time 
complexities: 0{d\og\ogm + nr) for a = 0{n) and 0{dlog\ogn + n{r + logn)) otherwise. 

The generalization of the LCS problem for many sequences is direct, but the time com- 
plexity of the exact algorithm computing the multidimensional DP matrix is 0{2^n^), where 
z is the number of sequences of length 0{n) each [7]. It is easy to notice that according 
to Lemma [1] the STR-IC-LCS problem generalizes in the same way and the worst-case time 
complexity is also 0{2^n^). 

5 Conclusions 

We investigated the STR-IC-LCS problem introduced recently. The fastest algorithms solv- 
ing this problem known to date needed cubic time in case of two main and one constraining 
sequences. Our algorithm is faster, as its time complexity is only quadratic. Moreover, the 
algorithm uses an LCS-computation procedure as a component and any progresses in the 
LCS computation can improve the time complexities of the proposed method. 

We also showed an irrecoverable flaw in [6] , in which the algorithm of better than cubic 
time complexity was recently proposed, i.e., we proved this algorithm is supercubic in the 
worst case. 
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